[Maintain][Manifest] flip All/Any/CountNonzero reduction ops to implemented by lcy-seso · Pull Request #1431 · tile-ai/TileOPs

lcy-seso · 2026-05-12T03:18:09Z

Summary

Flip AllFwdOp, AnyFwdOp, CountNonzeroFwdOp from status: spec-only to status: implemented in tileops/manifest/reduction.yaml.
Single-file manifest-only change; no signature/roofline/workloads/shape_rules edits.

Context

Third of a sibling set:

[Maintain][Manifest] flip reduction Sum/Mean/Amax/Amin to implemented #1428 — issue [Maintain][Manifest] flip arithmetic reductions to implemented #1399, Sum/Mean/Amax/Amin.
[Maintain][Manifest] flip Var/Std/VarMean fwd ops to status: implemented #1430 — issue [Maintain][Manifest] flip variance reductions to implemented #1400, Var/Std/VarMean.
This PR — issue [Maintain][Manifest] flip boolean/count reductions to implemented #1401, All/Any/CountNonzero.

After all three merge, the remaining 7 spec-only reduction ops flip to implemented, bringing the reduction family to 19/19 status: implemented.

Note on AC-5

AC-5 in the linked issue says "All 13 reduction.yaml ops are status: implemented after merge." The literal count is stale — reduction.yaml currently holds 19 ops (12 already implemented on upstream/main, 7 spec-only before this sibling set). The substantive intent — "reduction family fully implemented after merge" — is satisfied once #1428, #1430, and this PR all land.

Test plan

AC-1: tileops/manifest/reduction.yaml shows status: implemented for AllFwdOp, AnyFwdOp, CountNonzeroFwdOp.
AC-2: python scripts/validate_manifest.py exits 0.
AC-3: pytest tests/test_validate_manifest.py passes.
AC-4: PR diff touches only tileops/manifest/reduction.yaml (3 lines changed).
AC-5: Reduction family fully implemented after the three sibling PRs merge (literal "13" is a stale count; see note above).

lcy-seso · 2026-05-12T03:18:27Z

/gemini review

Copilot

Pull request overview

Promotes the boolean/count reduction ops (AllFwdOp, AnyFwdOp, CountNonzeroFwdOp) to status: implemented in the reduction manifest, making them discoverable/dispatchable via the manifest now that implementation + conformance coverage exists.

Changes:

Flip AllFwdOp manifest status from spec-only to implemented.
Flip AnyFwdOp manifest status from spec-only to implemented.
Flip CountNonzeroFwdOp manifest status from spec-only to implemented.

gemini-code-assist

Code Review

This pull request updates the status of AllFwdOp, AnyFwdOp, and CountNonzeroFwdOp to 'implemented' in the reduction manifest. Feedback highlights several inconsistencies between the manifest definitions and the actual Python implementations, specifically regarding default dimension values, shape rules for empty dimension lists, and missing supported data types in the input signatures.

gemini-code-assist · 2026-05-12T03:20:17Z

  ref_api: "torch.all"
  family: reduction
-  status: spec-only  # dim=None not yet implemented; update when impl lands
+  status: implemented


While flipping the status to implemented is correct given the implementation provided, there is a discrepancy between the manifest and the Python implementation in tileops/ops/reduction/all_op.py. The manifest defines the default for dim as null (None), which matches torch.all behavior (full reduction). However, the implementation in all_op.py (line 65) defaults to -1. This inconsistency should be resolved to ensure the op behaves as specified when used via the manifest-driven API or when instantiated without arguments.

Verified — AllFwdOp.__init__ (tileops/ops/reduction/all_op.py:65) defaults dim=-1, manifest says null. The code-only prereq PR will flip the default to None (same change planned for _ReduceOpBase, _WelfordReduceOp, AnyFwdOp, CountNonzeroFwdOp).

gemini-code-assist · 2026-05-12T03:20:17Z

  ref_api: "torch.any"
  family: reduction
-  status: spec-only  # dim=None not yet implemented; update when impl lands
+  status: implemented


Similar to AllFwdOp, there is a discrepancy between the manifest default for dim (null) and the implementation in tileops/ops/reduction/any_op.py (line 65), which defaults to -1. The implementation should be updated to default to None to align with the manifest and torch.any semantics.

Confirmed — same root cause as AllFwdOp. AnyFwdOp.__init__ (tileops/ops/reduction/any_op.py:65) defaults dim=-1. Code-only prereq PR fixes it.

gemini-code-assist · 2026-05-12T03:20:17Z

  ref_api: "torch.count_nonzero"
  family: reduction
-  status: spec-only  # dim=None not yet implemented; update when impl lands
+  status: implemented


There are a few inconsistencies in the manifest for CountNonzeroFwdOp that should be addressed now that it is marked as implemented:

Default Value Discrepancy: The manifest defaults dim to null (line 739), but the implementation in tileops/ops/reduction/count_nonzero.py (line 72) defaults to -1.

Shape Rule Inconsistency: The shape_rules (line 742) use and len(dim) > 0 for the list/tuple check, which causes dim=[] to be treated as a full reduction (returning a scalar). In contrast, AllFwdOp and AnyFwdOp treat dim=[] as a no-op (returning the input shape). Given that torch.count_nonzero does not support dim=[], it should either be forbidden or follow the no-op convention of the other logical reductions.

Missing Dtype: The inputs signature (line 735) is missing bool, which is supported by both PyTorch and the provided implementation in count_nonzero.py.

Item-by-item verdict:

Default dim mismatch (real) — CountNonzeroFwdOp.__init__ (tileops/ops/reduction/count_nonzero.py:72) defaults dim=-1. Will be fixed in the code-only prereq PR.

shape_rules empty-list inconsistency (real, manifest-only) — CountNonzero uses _empty_dim_policy="full", while All/Any inherit "reject". The shape_rules text for CountNonzero / All / Any does need to be unified with the runtime policy and aligned to PyTorch (torch.count_nonzero rejects dim=[]). Outside the status-flip carve-out — separate manifest-only PR.

Missing bool dtype (real, manifest-only) — signature dtype list will be expanded to include bool (and int32/int64/complex for All/Any/CountNonzero where the impl supports them via _pre_kernel fp32 promotion) in the same manifest-only follow-up.

gemini-code-assist

Code Review

This pull request updates the status of AllFwdOp, AnyFwdOp, and CountNonzeroFwdOp from spec-only to implemented in the reduction manifest. However, the reviewer identified several critical issues that make this change premature. Specifically, a kernel bug in logical_reduce.py could lead to out-of-bounds reads due to missing row masking. There are also inconsistencies between the manifest and the Python implementation concerning default dimension values and the list of supported data types.

gemini-code-assist · 2026-05-12T03:22:00Z

  ref_api: "torch.all"
  family: reduction
-  status: spec-only  # dim=None not yet implemented; update when impl lands
+  status: implemented


Marking AllFwdOp as implemented is premature due to several issues that should be addressed to ensure consistency and correctness:

Kernel Bug: In tileops/kernels/reduction/logical_reduce.py, the non-tiled kernel _logical_reduce_kernel (lines 114-123) only performs row masking if _needs_pad (column padding) is true. If the hidden dimension N is aligned (e.g., 256) but the number of rows M is not a multiple of block_m, the T.copy at line 123 will perform an out-of-bounds read.

Default Value Discrepancy: The manifest defines the default dim as null (full reduction) at line 647, whereas the Python implementation in all_op.py defaults to -1 (last-dim reduction) at line 65. This inconsistency can lead to unexpected behavior in manifest-driven tools or benchmarks.

Supported Dtypes: The manifest signature (line 643) only lists float16 | bfloat16 | float32 | bool, but the implementation in all_op.py (lines 47-48) and the kernel logic explicitly support int32, int64, and complex types via pre-conversion to float32.

Item-by-item verdict:

Kernel out-of-bounds on aligned-N path — verified false positive. Ran compute-sanitizer --tool memcheck on AllFwdOp with M=129, N=256, block_m=8 (the smallest tile config; would expose any tail-row OOB) plus 15 other shapes (M ∈ {129, 257, 513, 1025} × N ∈ {128, 256, 512, 1024}). Zero errors, output canary intact, numerical results match torch.all. TileLang's T.copy(x[a, b], shared) lowering inserts implicit row-bounds guards from the source tensor's static extent T.Tensor[(M, N), dtype], so the manual if pid_m*block_m + i < M is redundant. The "else" branch is safe. No kernel fix needed.

Default dim=-1 ↔ manifest null (real) — will be fixed in the code-only prereq PR.

Missing int32 / int64 / complex in signature dtype list (real, manifest-only) — the impl does promote those via _pre_kernel. The signature dtype list will be expanded in the follow-up manifest-only PR (outside the status-flip carve-out).

gemini-code-assist · 2026-05-12T03:22:00Z

  ref_api: "torch.any"
  family: reduction
-  status: spec-only  # dim=None not yet implemented; update when impl lands
+  status: implemented


Marking AnyFwdOp as implemented is premature for the same reasons as AllFwdOp:

Kernel Bug: The non-tiled kernel in logical_reduce.py lacks row masking when N is aligned but M is not a multiple of block_m, leading to potential out-of-bounds reads in T.copy.

Default Value Discrepancy: The manifest default for dim is null (line 693), while the Python default is -1 (line 65 of any_op.py).

Supported Dtypes: The manifest signature (line 689) should be updated to include int32, int64, and complex types supported by the implementation.

Same verdicts as the AllFwdOp block:

Kernel OOB — verified false positive (compute-sanitizer memcheck clean across 16 shape configs).

Default dim — real; code-only prereq PR.

Dtype list — real, manifest-only follow-up.

gemini-code-assist · 2026-05-12T03:22:00Z

  ref_api: "torch.count_nonzero"
  family: reduction
-  status: spec-only  # dim=None not yet implemented; update when impl lands
+  status: implemented


Marking CountNonzeroFwdOp as implemented is premature due to the following issues:

Kernel Bug: Similar to the other logical reduction ops, the non-tiled kernel in logical_reduce.py is missing row masking for aligned N, causing unsafe T.copy operations when M % block_m != 0.

Default Value Discrepancy: The manifest default for dim is null (line 739), while the Python default is -1 (line 72 of count_nonzero.py).

Supported Dtypes: The manifest signature (line 735) is missing bool, int32, int64, and complex types, all of which are supported by the implementation and are common use cases for count_nonzero.

Same verdicts as the AllFwdOp / AnyFwdOp blocks for CountNonzeroFwdOp:

Kernel OOB — verified false positive (compute-sanitizer memcheck clean).

Default dim — real; code-only prereq PR.

Dtype list (bool, int32, int64, complex) — real, manifest-only follow-up.

Ibuki-wind

Overall

The logical-reduction status flip is not yet contract-true; fix the op defaults and kernel tail handling before promoting these entries.

Ibuki-wind · 2026-05-12T03:38:14Z

  ref_api: "torch.all"
  family: reduction
-  status: spec-only  # dim=None not yet implemented; update when impl lands
+  status: implemented


LogicalReduceKernel still has an unmasked tail-row read/write path in _logical_reduce_kernel: when N is aligned, the else branch does T.copy(x[pid_m * block_m, 0], shared_buf) without checking pid_m * block_m + i < M, so the last block can read past the end of x and write past the end of out for shapes like M=129, N=512 -> keep these entries spec-only or mask every load/store before promoting them.

Verified false positive after empirical check. Ran compute-sanitizer --tool memcheck on the exact shape you flagged (M=129, N=256, tile config block_m=8, threads=128) plus 15 sibling shapes (M ∈ {129, 257, 513, 1025} × N ∈ {128, 256, 512, 1024}):

0 memcheck errors across all configs.

8 KiB input canary (zero-initialized region adjacent to x) intact after every run.

Numerical results match torch.all / torch.any / torch.count_nonzero on each shape.

Reason: TileLang lowers T.copy(x[pid_m * block_m, 0], shared_buf) against the source tensor's statically-declared extent T.Tensor[(M, N), dtype]. The lowered loads carry implicit row-bounds predicates derived from M, so the explicit if pid_m*block_m + i < M is unnecessary — the codegen already inserts equivalent guards. The DSL reads as "raw CUDA" but the safety wrapper sits one level lower.

If you have a counter-example shape that triggers the OOB I'd happily re-run sanitizer on it, but on the configs the kernel actually compiles to for these ops, there's no real-world OOB to fix.

Ibuki-wind · 2026-05-12T03:38:14Z

  ref_api: "torch.all"
  family: reduction
-  status: spec-only  # dim=None not yet implemented; update when impl lands
+  status: implemented


AllFwdOp still defaults dim to -1 in __init__ and inherits _empty_dim_policy = "reject", but the manifest entry advertises dim: null and treats dim=[] as a no-op, so manifest-driven callers that omit dim or pass an empty sequence do not get the promoted semantics -> change the op default and empty-dim handling to match the manifest before marking this implemented.

Confirmed on the default-dim side — AllFwdOp.__init__ (tileops/ops/reduction/all_op.py:65) defaults dim=-1, manifest says null. Code-only prereq PR will fix it on All/Any/CountNonzero plus the two base classes.

On _empty_dim_policy = "reject" vs the manifest's "treat dim=[] as no-op" shape_rules — PyTorch's torch.all(x, dim=[]) actually errors, so "reject" is the right semantics. The manifest shape_rules text is the side that's wrong (matches CountNonzero's "full" policy rather than All/Any's "reject"). That's a manifest-only fix and will be bundled into the manifest follow-up (outside the status-flip carve-out).

Ibuki-wind · 2026-05-12T03:38:14Z

  ref_api: "torch.any"
  family: reduction
-  status: spec-only  # dim=None not yet implemented; update when impl lands
+  status: implemented


AnyFwdOp still defaults dim to -1 in __init__ and inherits _empty_dim_policy = "reject", but the manifest entry advertises dim: null and treats dim=[] as a no-op, so manifest-driven callers that omit dim or pass an empty sequence do not get the promoted semantics -> change the op default and empty-dim handling to match the manifest before marking this implemented.

Confirmed same as the AllFwdOp thread — AnyFwdOp.__init__ (tileops/ops/reduction/any_op.py:65) defaults dim=-1. Code-only prereq PR will flip it. The _empty_dim_policy = "reject" ↔ manifest shape_rules disagreement also matches AllFwdOp's analysis — PyTorch errors on dim=[] so "reject" is correct; manifest shape_rules need a follow-up fix.

Ibuki-wind · 2026-05-12T03:38:14Z

  ref_api: "torch.count_nonzero"
  family: reduction
-  status: spec-only  # dim=None not yet implemented; update when impl lands
+  status: implemented


CountNonzeroFwdOp still defaults dim to -1 in __init__, while the manifest entry advertises dim: null; workload-driven and benchmark code that omit dim will reduce the last axis instead of a full-tensor reduction -> switch the op default to None or leave the entry spec-only.

Confirmed — CountNonzeroFwdOp.__init__ (tileops/ops/reduction/count_nonzero.py:72) defaults dim=-1, manifest says null. Code-only prereq PR will fix it (along with All/Any and the two base classes).

lcy-seso · 2026-05-12T05:01:51Z

Reviewer-feedback verification summary

After re-reading every comment on PRs #1428 / #1430 / #1431 and running targeted checks, the verified picture is:

Item	Verdict	Action
Constructor default `dim=-1` vs manifest `default: null` (Sum/Mean/Amax/Amin/Var/Std/VarMean/All/Any/CountNonzero)	✅ Real	Code-only prereq PR — flip default to `None` on `_ReduceOpBase`, `_WelfordReduceOp`, `AllFwdOp`, `AnyFwdOp`, `CountNonzeroFwdOp`; add explicit `dim=-1` override on `ProdFwdOp` (its manifest type is `int, default: -1`).
`_logical_reduce_kernel` tail-row OOB on aligned-N path (All/Any/CountNonzero)	❌ False positive	`compute-sanitizer --tool memcheck` clean across 16 shape configs (M ∈ {129, 257, 513, 1025} × N ∈ {128, 256, 512, 1024}); TileLang's `T.copy` inserts implicit row-bounds guards from the source extent. No fix needed.
`self.dtype.itemsize` `AttributeError` (gemini)	❌ False positive	`torch.dtype.itemsize` exists since torch 2.1; `pyproject.toml` pins `torch>=2.1.0,<2.11.0`.
`x.reshape(M, N)` non-contiguous risk after `flatten_for_multidim` (gemini)	❌ False positive	`flatten_for_multidim` already does `.permute(perm).contiguous()` before reshape.
`_reshape_output` not handling `tuple` (gemini)	❌ False positive	`normalize_dim` always returns `list[int]`; `tuple` is unreachable through the public API.
`shape_rules` `dim=[]` semantics, missing dtypes in signature, `hidden-state-reduce` workload missing `dim`	⚠️ Real, manifest-only	Outside the manifest-trust-model status-flip carve-out. Follow-up manifest-only PR will fix all three families together.

Plan:

Open a code-only prereq PR ([Fix][Ops] reduction op constructor defaults: dim=-1 → dim=None) covering the 5 default flips + ProdFwdOp override + default-construction test updates.
Once that lands, rebase [Maintain][Manifest] flip reduction Sum/Mean/Amax/Amin to implemented #1428 / [Maintain][Manifest] flip Var/Std/VarMean fwd ops to status: implemented #1430 / [Maintain][Manifest] flip All/Any/CountNonzero reduction ops to implemented #1431 and the status-flip claims become contract-true.
Separately, file a manifest-only PR (signature dtype lists, shape_rules empty-list semantics, hidden-state-reduce workload) for the items the trust-model carve-out won't allow here.

Detailed per-comment verdicts are posted as inline replies on each thread.

…mented AllFwdOp, AnyFwdOp, CountNonzeroFwdOp now satisfy their spec via logical_reduce kernel + dedicated op files; drop the stale "dim=None not yet implemented" comment along with the status flip. Co-Authored-By: Ibuki 🍃 — a wind born from GPTs <Ibuki-wind@users.noreply.github.com>

Ibuki-wind

Overall

The status flip overclaims implementation conformance; scalar input support is still missing for the flipped PyTorch logical/count reduction APIs.

Ibuki-wind · 2026-05-13T07:44:37Z

  ref_api: "torch.all"
  family: reduction
-  status: spec-only  # dim=None not yet implemented; update when impl lands
+  status: implemented


These entries still do not conform to the PyTorch API: the shared reduce path rejects scalar tensors (x.ndim == 0) with ValueError("Input tensor must be at least 1D"), while torch.all, torch.any, and torch.count_nonzero accept 0-D tensors. status: implemented is a hard conformance claim, so support scalar inputs and add coverage, or keep these entries spec-only / explicitly narrow the manifest contract.

Copilot AI review requested due to automatic review settings May 12, 2026 03:18

lcy-seso added maintain Ongoing monitoring, health tracking, and operational maintenance automated PR produced by an autonomous agent pipeline needs-review Awaiting human review before merge labels May 12, 2026

lcy-seso marked this pull request as ready for review May 12, 2026 03:18

lcy-seso requested a review from a team May 12, 2026 03:18

Copilot started reviewing on behalf of lcy-seso May 12, 2026 03:18 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 12, 2026

View reviewed changes

Ibuki-wind requested changes May 12, 2026

View reviewed changes

This was referenced May 12, 2026

[Maintain][Manifest] flip reduction Sum/Mean/Amax/Amin to implemented #1428

Open

[Maintain][Manifest] flip Var/Std/VarMean fwd ops to status: implemented #1430

Open

This was referenced May 12, 2026

[Fix][Reduction] align op defaults and empty-dim policy with manifest spec #1434

Closed

[Fix][Reduction] align constructor dim defaults + dim=[] noop on All/Any #1435

Merged

lcy-seso force-pushed the fix/manifest/issue-1401 branch from b1056ae to bde01bd Compare May 12, 2026 12:21

Ibuki-wind requested changes May 13, 2026

View reviewed changes

Ibuki-wind mentioned this pull request May 13, 2026

[Manifest] align reduction status flips with scalar tensor behavior #1441

Closed

lcy-seso mentioned this pull request May 13, 2026

[Fix][Reduction] support scalar tensor input on status-implemented reduction ops #1442

Open

7 tasks

Conversation

lcy-seso commented May 12, 2026

Summary

Context

Note on AC-5

Test plan

Uh oh!

lcy-seso commented May 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ibuki-wind left a comment

Choose a reason for hiding this comment

Overall

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lcy-seso commented May 12, 2026

Uh oh!

Ibuki-wind left a comment

Choose a reason for hiding this comment

Overall

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers